This paper brings to light how network embedding graphs can help to solve major open problems in natural language understanding. One illustrative element in NLP could be language variability - or ambiguity problem - which happens when two sentences express the same meaning (or ideas) with very different words. For instance we may say almost interchangeably: “where is the nearest sushi restaurant?” or “can you please give me addresses of sushi places nearby?”. These two sentences exactly share the same meaning with a different semantic wording. Here is the big challenge that we are struggling. In terms of data science, it bears witness of a well-known problem called text similarity. Indeed, my sparse vectors for the 2 sentences have no common words and consequently will have a cosine distance of 1. This is a terrible distance score because the 2 sentences have very similar meanings.
The first thing that is crossing any data scientists’mind would have been to use popular document embedding methods based on similarity measures such as Doc2Vec, Average w2v vectors, Weighted average w2v vectors (e.g. tf-idf), RNN-based embeddings (e.g. deep LSTM networks), … to cope with this text similarity challenge.
As for us, we will tackle this text similarity challenge by implementing network graph embeddings in light of traditional word embeddings technics.
In very basic terms, word embeddings turns corpus text into numerical vectors. Consequently two different words - sharing in common a same semantic similarity - are close in term of Euclidean distance into a given high dimensional space. Words that have the same meaning have a similar representation - or a very close numerical vectors.
We are born with the intention of implementing some Natural Language Processing (NLP) within Graph Databases and Neo4j. First, few words when it comes to Neo4j which is a graph database management system developed by Neo4j. The underlying concept is very simple: everything is stored in the form of either an edge, a node, or an attribute. Each node and edge can have any number of attributes. Both the nodes and edges can be labelled.
Let’s consider two sentences:
\(S_{1}\) = Where is the nearest sushi restaurant?
\(S_{2}\) = Can you please give me addresses of sushi places nearby?
\(S_{1}\) = {“Where”,“nearest”,“sushi”,“restaurant”}
\(S_{2}\) = {“give”,“addresses”,“sushi”,“places”,“nearby”}
left(“sushi”) = {“where”,“nearest”,“give”,“addresses”}
right(“sushi”) = {“restaurant”,“places”,“nearby”}
Here is the Cypher Query to put our two sentences into neo4j graph database:
WITH split(tolower("Where nearest sushi restaurant", "") AS text
UNWIND range(0, size(text)-2) AS i
MERGE (w1:Word {name: text[i]})
MERGE (w2: Word {name: text[i+1]})
MERGE (w1)-[:NEXT]-> (w2)
Let’s consider two news sentences:
\(S_{1}\) = My boss eats sushi on Friday
\(S_{2}\) = My brother eats pizza on Sunday evening
Sim("boss","brother") =
Sim(left("boss"), right("boss")) +
Sim(left("brother"), right("brother"))
Let’s consider https://newsapi.org/ which is a simple and easy-to-use API that returns JSON metadata for headlines and articles live all over the web right now. News API indexes articles from over 30,000 worldwide sources.
| author | description | publishedAt | source | title |
|---|---|---|---|---|
| http://www.abc.net.au/news/lisa-millar/166890 | In the month following Donald Trump’s inauguration it’s clear that Russians are no longer jumping down the aisles. | 2017-02-26T08:08:20Z | abc-news-au | Has Russia changed its tone towards Donald Trump? |
| http://www.abc.net.au/news/emily-sakzewski/7680548 | A fasting diet could reverse diabetes and repair the pancreas, US researches discover. | 2017-02-26T04:39:24Z | abc-news-au | Fasting diet ‘could reverse diabetes and regenerate pancreas’ |
| http://www.abc.net.au/news/jackson-vernon/7531870 | Researchers discover what could be one of the worst cases of mine pollution in the world in the heart of New South Wales’ pristine heritage-listed Blue Mountains. | 2017-02-26T02:02:28Z | abc-news-au | Mine pollution turning Blue Mountains river into ‘waste disposal’ |
| http://www.abc.net.au/news/sophie-mcneill/4516794 | Yemen is now classified as the world’s worst humanitarian disaster but Australia has committed no funding to help save lives there. | 2017-02-26T09:56:12Z | abc-news-au | Australia ignores unfolding humanitarian catastrophe in Yemen |
| http://www.abc.net.au/news/dan-conifer/5189074, http://www.abc.net.au/news/6815894 | Malcolm Turnbull and Joko Widodo hold talks in Sydney, reviving cooperation halted after the discovery of insulting posters at a military base, and reaching deals on trade and a new consulate in east Java. | 2017-02-26T03:43:04Z | abc-news-au | Australia and Indonesia agree to fully restore military ties |
| Ron Amadeo | If this is how BlackBerry wants to do hardware, we really won’t miss them. | 2017-02-25T21:00:08Z | ars-technica | BlackBerry KeyOne Hands On—BlackBerry wants $549 for mid-range device |
| Roheeni Saxena | States that legalized gay marriage early created a natural experiment. | 2017-02-25T20:00:37Z | ars-technica | Same-sex marriage linked to decline in teen suicides |
| Roheeni Saxena | We may finally be getting somewhere in our fight against the disease. | 2017-02-25T19:00:16Z | ars-technica | New malaria vaccine is fully effective in very small clinical trial |
Let’s switch from a linear and static SQL dataframe to a dynamic NoSQL database - i.e. from relational to non-relational database.
As you can spot, there are 9 variables within our dataframe coming from newsAPI. The job is to transform individual column into a relational data model. Let’s take an instance to make a good start.
“author” variable is turning into red pen logo inside our new NoSQL database
“description” variable is turning into orange comments logo inside our new NoSQL database
“publishedAt” variableis turning into green clock logo inside our new NoSQL database
“source” variable is turning into yellow compass logo inside our new NoSQL database
“category” variable is turning into black compass logo inside our new NoSQL database
“title” variable is turning into blue folder logo inside our new NoSQL database
It should not be forgotten that the neo4j graph just below is just an isolated implementation of a given piece of information. Now the time has come to repeat a command looping over each news. In order to industrialise the building processes of our neo4j database, we are going to implement it on https://neo4j.com/
This big step will be the main topic of paper 2 [put the link here file:///C:/Users/adsieg/Desktop/projet_perso/part_1/part_2.html]
To give a first hint of what is neo4j, here is home screen of my database in which all news will be stocked…
Just to provide some context of why we need to lead this step - let’s consider a concrete example to highlight how important links are to retrieve quickly and efficiently information as well as bringing to light unseeable links which stand for within a static and linear database.
ggplot(news_data, aes(x=category, fill=category)) + geom_bar() + theme_bw()general_news <- news_data %>%
filter(category == "general") %>%
select(description)
general_news_bigrams <- general_news %>%
unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
as.data.frame()
general_news_bigrams_counts<- general_news_bigrams %>%
count(bigram, sort = TRUE)
kable(general_news_bigrams_counts[c(1:10),]) %>%
kable_styling(bootstrap_options = "striped", full_width = F)| bigram | n |
|---|---|
| of the | 2227 |
| in the | 2063 |
| on the | 1518 |
| to the | 1109 |
| the most | 846 |
| in a | 788 |
| donald trump | 770 |
| president donald | 752 |
| ap â | 713 |
| the internet | 686 |
As we can see there are a lot of StopWords such as - they bring no information and consequently
bigrams_separated <- general_news_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)bigram_counts %>%
filter(n >= 50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()